-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow PTransforms to be applied directly to dataframes. #25919
Conversation
R: @damccorm |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
|
||
def _maybe_elementwise_or(self, right): | ||
if isinstance(right, PTransform): | ||
return convert.to_dataframe(convert.to_pcollection(self) | right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, it is worth calling out that we're opening users up to doing an inefficient thing where they go:
df = convert.to_dataframe(pc)
df2 = df | MyDfOperation()
df3 = df2 | MySchemaTransform() | MySchemaTransform2() | MySchemaTransform3()
result = convert.to_dataframe(df3)
which would be (much?) more efficient written as:
df = convert.to_dataframe(pc)
df2 = df | MyDfOperation()
result = convert.to_dataframe(df2) | MySchemaTransform() | MySchemaTransform2() | MySchemaTransform3()
because this avoids the repeated to_dataframe
/to_pcollection
transition. The former is probably more natural though, especially if you have real df operations mixed in there even if its less efficient. I think the user experience still trumps the efficiency loss, but it might be something we want to doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note. Once we push the batching stuff through, there could be little to no overhead here, but that's future work.
Codecov Report
@@ Coverage Diff @@
## master #25919 +/- ##
==========================================
- Coverage 71.41% 71.41% -0.01%
==========================================
Files 778 778
Lines 102420 102441 +21
==========================================
+ Hits 73146 73156 +10
- Misses 27818 27829 +11
Partials 1456 1456
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 12 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Just checking whether "I think this is fine" is an LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking whether "I think this is fine" is an LGTM.
LGTM - I did want to block on the proto changes which is why I didn't give the explicit LGTM, sorry for the ambiguity
No problem. And, yes, the proto changes were merged separately. |
This'll be especially handy for applying RunInference.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.